January 2019
Describes the expected outcome of a single event with probability p
Example of flipping of a fair coin once
\[Pr(X=\text{Head}) = \frac{1}{2} = 0.5 = p \]
\[Pr(X=\text{Tails}) = \frac{1}{2} = 0.5 = 1 - p \]
\[ p + (1-p) = 1 \]
\[ Pr(\text{X=H and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=H and Y=T}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=T}) = p*p = p^2 \]
H and T can occur in any order\[ \text{Pr(H and T) =} \] \[ \text{Pr(X=H and Y=T) or Pr(X=T and Y=H)} = \] \[ (p*p) + (p*p) = 2p^{2} \]
A binomial distribution results from the combination of several independent Bernoulli events
The distribution of probabilities for each combination of outcomes is
\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]
n is the total number of trialsk is the number of successesp is the probability of successq is the probability of not successp = 1-q\[\large p^{k} (1-p)^{n-k}\]
This part (called the binomial coefficient) is the number of different ways each combination of outcomes can be achieved (summation)
\[\large {n \choose k}\]
Together they equal the probability of a specified number of successes
\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]
Another common situation in biology is when each trial is discrete but number of observations of each outcome is observed
Pr(Y=r) is the probability that the number of occurrences of an event y equals a count r in the total number of trials\[Pr(Y=r) = \frac{e^{-\mu}\mu^r}{r!}\]
\[Pr(y=r) = \frac{e^{-\lambda}\lambda^r}{r!}\]
where \[\large \pi \approx 3.14159\]
\[\large \epsilon \approx 2.71828\]
To write that a variable (v) is distributed as a normal distribution with mean \(\mu\) and variance \(\sigma^2\), we write the following:
\[\large v \sim \mathcal{N} (\mu,\sigma^2)\]
Estimate of the mean from a single sample
\[\Large \bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i} \]
Estimate of the variance from a single sample
\[\Large s^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i - \bar{x})^2} \]
The standard deviation is the square root of the variance
\[\Large s = \sqrt{s^2} \]
\[\huge z_i = \frac{(x_i - \bar{x})}{s}\] ## R Interlude | Complete Exercises 3.1-3.2
\(H_0\) : Null hypothesis : Ponderosa pine trees are the same height on average as Douglas fir trees
\(H_A\) : Alternative Hypothesis: Ponderosa pine trees are not the same height on average as Douglas fir trees
What is the probability that we would reject a true null hypothesis?
What is the probability that we would accept a false null hypothesis?
How do we decide when to reject a null hypothesis and support an alternative?
What can we conclude if we fail to reject a null hypothesis?
What parameter estimates of distributions are important to test hypotheses?
\(H_0\) : Null hypothesis : Ponderosa pine trees are the same height on average as Douglas fir trees
\[H_0 : \mu_1 = \mu_2\]
\(H_A\) : Alternative Hypothesis: Ponderosa pine trees are not the same height as Douglas fir trees
\[H_A : \mu_1 \neq \mu_2\]
\[\huge t = \frac{(\bar{y}_1-\bar{y}_2)}{s_{\bar{y}_1-\bar{y}_2}} \]
where
which is the calculation for the standard error of the mean difference
Mean of x in each case 9 (exact)
Variance of x in each case 11 (exact)
Mean of y in each case 7.50 (to 2 decimal places)
Variance of y in each case 4.122 or 4.127 (to 3 decimal places)
Correlation between x and y in each case 0.816 (to 3 decimal places)
Linear regression line in each case \[ y = 3.00 + 0.50x\]
General Linear Model (GLM) - two or more continuous variables
General Linear Mixed Model (GLMM) - a continuous response variable with a mix of continuous and categorical predictor variables
Generalized Linear Model - a GLMM that doesn’t assume normality of the response (we’ll get to this later)
Generalized Additive Model (GAM) - a model that doesn’t assume linearity (we won’t get to this later)
All an be written in the form
response variable = intercept + (explanatory_variables) + random_error
in the general form:
\[ Y=\beta_0 +\beta_1*X_1 + \beta_2*X_2 +... + error\]
where \(\beta_0, \beta_1, \beta_2, ....\) are the parameters of the linear model
All of these will include the intercept
Y~X Y~1+X Y~X+1
All of these will exclude the intercept
Y~-1+X Y~X-1
Need to fit the model and then 'read' the output
trial_lm <- lm(Y~X) summary (trial_lm)
\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]
full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)
reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)
Estimation of the variation that is explained by the model (SS_model)
SS_model = SS_total(reduced model) - SS_residual(full model)
The variation that is unexplained by the model (SS_residual)
SS_residual(full model)
\[r^2 = SS_{regression}/SS_{total} = 1 - (SS_{residual}/SS_{total})\] or \[r^2 = 1 - (SS_{residual(full)}/SS_{total(reduced)})\] Which is the proportion of the variance in Y that is explained by X
\[\beta_{YX}=\rho_{YX}*\sigma_Y/\sigma_X\] \[b_{YX} = r_{YX}*S_Y/S_X\]
\[y_i = \beta_0 + \beta_1 * x_I + \epsilon_i\]
What if your residuals aren’t normal because of outliers?
Nonparametric methods exist, but these don’t provide parameter estimates with CIs.
Robust regression (rlm)
Randomization tests
Leverage - a measure of how much of an outlier each point is in x-space (on x-axis) and thus only applies to the predictor variable. (Values > 2*(2/n) for simple regression are cause for concern)
Residuals - As the residuals are the differences between the observed and predicted values along a vertical plane, they provide a measure of how much of an outlier each point is in y-space (on y-axis). The patterns of residuals against predicted y values (residual plot) are also useful diagnostic tools for investigating linearity and homogeneity of variance assumptions
Cook’s D statistic is a measure of the influence of each point on the fitted model (estimated slope) and incorporates both leverage and residuals. Values ≥ 1 (or even approaching 1) correspond to highly influential observations.
To develop a better predictive model than is possible from models based on single independent variables.
To investigate the relative individual effects of each of the multiple independent variables above and beyond the effects of the other variables.
The individual effects of each of the predictor variables on the response variable can be depicted by single partial regression lines.
The slope of any single partial regression line (partial regression slope) thereby represents the rate of change or effect of that specific predictor variable (holding all the other predictor variables constant to their respective mean values) on the response variable.
Additive model \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + B_jx_{ij} + \epsilon_i\]
Multiplicative model (with two predictors) \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + B_3x_{i1}x_{i2} + \epsilon_i\]
library(car) scatterplotMatrix(~var1+var2+var3, diag=”boxplot”)
From Langford, D. J.,et al. 2006. Science 312: 1967-1970
In words:
stretching = intercept + treatment
- The model statement includes a response variable, a constant, and an explanatory variable.
- The only difference with regression is that here the explanatory variable is categorical.
Survival of climbers of Mount Everest is higher for individuals taking supplemental oxygen than those who don’t.
Why?
The function is to objectively present your key results, without interpretation, in an orderly and logical sequence using both text and illustrative materials (Tables and Figures).
The results section always begins with text, reporting the key results and referring to figures and tables as you proceed.
The text of the Results section should be crafted to follow this sequence and highlight the evidence needed to answer the questions/hypotheses you investigated.
Important negative results should be reported, too. Authors usually write the text of the results section based upon the sequence of Tables and Figures.
May appear either in the text (usually parenthetically) or in the relevant Tables or Figures (in the legend or as footnotes to the Table or Figure). Each Table and Figure must be referenced in the text portion of the results, and you must tell the reader what the key result(s) is that each Table or Figure conveys.
For example, suppose you asked the question, "Is the average height of male students the same as female students in a pool of randomly selected Biology majors?" You would first collect height data from large random samples of male and female students. You would then calculate the descriptive statistics for those samples (mean, SD, n, range, etc) and plot these numbers. Suppose you found that male Biology majors are, on average, 12.5 cm taller than female majors; this is the answer to the question. Notice that the outcome of a statistical analysis is not a key result, but rather an analytical tool that helps us understand what is our key result.
Report your results so as to provide as much information as possible to the reader about the nature of differences or relationships.
For example, if you are testing for differences among groups, and you find a significant difference, it is not sufficient to simply report that "groups A and B were significantly different". How are they different? How much are they different?
It is much more informative to say something like, "Group A individuals were 23% larger than those in Group B", or, "Group B pups gained weight at twice the rate of Group A pups."
Report the direction of differences (greater, larger, smaller, etc) and the magnitude of differences (% difference, how many times, etc.) whenever possible.
Statistical test summaries (test name, p-value) are usually reported parenthetically in conjunction with the biological results they support. This parenthetical reference should include the statistical test used, the value, degrees of freedom and the level of significance.
For example, if you found that the mean height of male Biology majors was significantly larger than that of female Biology majors, you might report this result (in blue) and your statistical conclusion (shown in red) as follows:
If the summary statistics are shown in a figure, the sentence above need not report them specifically, but must include a reference to the figure where they may be seen:
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
"Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space"
— Edward Tufte
xxx
xxx
Draw graphical elements clearly, minimizing clutter
xxx
Represent magnitudes honestly and accurately
xxx
xxx
xxx
xxx
“Graphical excellence begins with telling the truth about the data” – Tufte 1983
“Graphical excellence consists of complex ideas communicated with clarity, precision and efficiency” – Tufte 1983
xxx